{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/node_postprocessor/MetadataReplacementDemo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Metadata Replacement + Node Sentence Window\n",
    "\n",
    "In this notebook, we use the `SentenceWindowNodeParser` to parse documents into single sentences per node. Each node also contains a \"window\" with the sentences on either side of the node sentence.\n",
    "\n",
    "Then, after retrieval, before passing the retrieved sentences to the LLM, the single sentences are replaced with a window containing the surrounding sentences using the `MetadataReplacementNodePostProcessor`.\n",
    "\n",
    "This is most useful for large documents/indexes, as it helps to retrieve more fine-grained details.\n",
    "\n",
    "By default, the sentence window is 5 sentences on either side of the original sentence.\n",
    "\n",
    "In this case, chunk size settings are not used, in favor of following the window settings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-embeddings-openai\n",
    "%pip install llama-index-embeddings-huggingface\n",
    "%pip install llama-index-llms-openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"OPENAI_API_KEY\"] = \"sk-...\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.llms.openai import OpenAI\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "from llama_index.embeddings.huggingface import HuggingFaceEmbedding\n",
    "from llama_index.core.node_parser import SentenceWindowNodeParser\n",
    "from llama_index.core.node_parser import SentenceSplitter\n",
    "\n",
    "# create the sentence window node parser w/ default settings\n",
    "node_parser = SentenceWindowNodeParser.from_defaults(\n",
    "    window_size=3,\n",
    "    window_metadata_key=\"window\",\n",
    "    original_text_metadata_key=\"original_text\",\n",
    ")\n",
    "\n",
    "# base node parser is a sentence splitter\n",
    "text_splitter = SentenceSplitter()\n",
    "\n",
    "llm = OpenAI(model=\"gpt-3.5-turbo\", temperature=0.1)\n",
    "embed_model = HuggingFaceEmbedding(\n",
    "    model_name=\"sentence-transformers/all-mpnet-base-v2\", max_length=512\n",
    ")\n",
    "\n",
    "from llama_index.core import Settings\n",
    "\n",
    "Settings.llm = llm\n",
    "Settings.embed_model = embed_model\n",
    "Settings.text_splitter = text_splitter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Data, Build the Index\n",
    "\n",
    "In this section, we load data and build the vector index."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Data\n",
    "\n",
    "Here, we build an index using chapter 3 of the recent IPCC climate report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: www..ch\n"
     ]
    }
   ],
   "source": [
    "!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import SimpleDirectoryReader\n",
    "\n",
    "documents = SimpleDirectoryReader(\n",
    "    input_files=[\"./IPCC_AR6_WGII_Chapter03.pdf\"]\n",
    ").load_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extract Nodes\n",
    "\n",
    "We extract out the set of nodes that will be stored in the VectorIndex. This includes both the nodes with the sentence window parser, as well as the \"base\" nodes extracted using the standard parser."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nodes = node_parser.get_nodes_from_documents(documents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "base_nodes = text_splitter.get_nodes_from_documents(documents)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build the Indexes\n",
    "\n",
    "We build both the sentence index, as well as the \"base\" index (with default chunk sizes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import VectorStoreIndex\n",
    "\n",
    "sentence_index = VectorStoreIndex(nodes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "base_index = VectorStoreIndex(base_nodes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Querying\n",
    "\n",
    "### With MetadataReplacementPostProcessor\n",
    "\n",
    "Here, we now use the `MetadataReplacementPostProcessor` to replace the sentence in each node with it's surrounding context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There is low confidence in the quantification of Atlantic Meridional Overturning Circulation (AMOC) changes in the 20th century due to low agreement in quantitative reconstructed and simulated trends. Additionally, direct observational records since the mid-2000s remain too short to determine the relative contributions of internal variability, natural forcing, and anthropogenic forcing to AMOC change. However, it is very likely that AMOC will decline for all SSP scenarios over the 21st century, but it will not involve an abrupt collapse before 2100.\n"
     ]
    }
   ],
   "source": [
    "from llama_index.core.postprocessor import MetadataReplacementPostProcessor\n",
    "\n",
    "query_engine = sentence_index.as_query_engine(\n",
    "    similarity_top_k=2,\n",
    "    # the target key defaults to `window` to match the node_parser's default\n",
    "    node_postprocessors=[\n",
    "        MetadataReplacementPostProcessor(target_metadata_key=\"window\")\n",
    "    ],\n",
    ")\n",
    "window_response = query_engine.query(\n",
    "    \"What are the concerns surrounding the AMOC?\"\n",
    ")\n",
    "print(window_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also check the original sentence that was retrieved for each node, as well as the actual window of sentences that was sent to the LLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Window: Nevertheless, projected future annual cumulative upwelling wind \n",
      "changes at most locations and seasons remain within ±10–20% of \n",
      "present-day values (medium confidence) (WGI AR6 Section  9.2.3.5; \n",
      "Fox-Kemper et al., 2021).\n",
      " Continuous observation of the Atlantic meridional overturning \n",
      "circulation (AMOC) has improved the understanding of its variability \n",
      "(Frajka-Williams et  al., 2019), but there is low confidence in the \n",
      "quantification of AMOC changes in the 20th century because of low \n",
      "agreement in quantitative reconstructed and simulated trends (WGI \n",
      "AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 2021). \n",
      " Direct observational records since the mid-2000s remain too short to \n",
      "determine the relative contributions of internal variability, natural \n",
      "forcing and anthropogenic forcing to AMOC change (high confidence) \n",
      "(WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., \n",
      "2021).  Over the 21st century, AMOC will very likely decline for all SSP \n",
      "scenarios but will not involve an abrupt collapse before 2100 (WGI \n",
      "AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).\n",
      " 3.2.2.4 Sea Ice Changes\n",
      "Sea ice is a key driver of polar marine life, hosting unique ecosystems \n",
      "and affecting diverse marine organisms and food webs through its \n",
      "impact on light penetration and supplies of nutrients and organic \n",
      "matter (Arrigo, 2014).  Since the late 1970s, Arctic sea ice area has \n",
      "decreased for all months, with an estimated decrease of 2 million km2 \n",
      "(or 25%) for summer sea ice (averaged for August, September and \n",
      "October) in 2010–2019 as compared with 1979–1988 (WGI AR6 \n",
      "Section 9.3.1.1; Fox-Kemper et al., 2021). \n",
      "------------------\n",
      "Original Sentence: Over the 21st century, AMOC will very likely decline for all SSP \n",
      "scenarios but will not involve an abrupt collapse before 2100 (WGI \n",
      "AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).\n"
     ]
    }
   ],
   "source": [
    "window = window_response.source_nodes[0].node.metadata[\"window\"]\n",
    "sentence = window_response.source_nodes[0].node.metadata[\"original_text\"]\n",
    "\n",
    "print(f\"Window: {window}\")\n",
    "print(\"------------------\")\n",
    "print(f\"Original Sentence: {sentence}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Contrast with normal VectorStoreIndex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The concerns surrounding the AMOC are not provided in the given context information.\n"
     ]
    }
   ],
   "source": [
    "query_engine = base_index.as_query_engine(similarity_top_k=2)\n",
    "vector_response = query_engine.query(\n",
    "    \"What are the concerns surrounding the AMOC?\"\n",
    ")\n",
    "print(vector_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, that didn't work. Let's bump up the top k! This will be slower and use more tokens compared to the sentence window index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are concerns surrounding the AMOC (Atlantic Meridional Overturning Circulation). The context information mentions that the AMOC will decline over the 21st century, with high confidence but low confidence for quantitative projections.\n"
     ]
    }
   ],
   "source": [
    "query_engine = base_index.as_query_engine(similarity_top_k=5)\n",
    "vector_response = query_engine.query(\n",
    "    \"What are the concerns surrounding the AMOC?\"\n",
    ")\n",
    "print(vector_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis\n",
    "\n",
    "So the `SentenceWindowNodeParser` + `MetadataReplacementNodePostProcessor` combo is the clear winner here. But why?\n",
    "\n",
    "Embeddings at a sentence level seem to capture more fine-grained details, like the word `AMOC`.\n",
    "\n",
    "We can also compare the retrieved chunks for each index!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Over the 21st century, AMOC will very likely decline for all SSP \n",
      "scenarios but will not involve an abrupt collapse before 2100 (WGI \n",
      "AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).\n",
      "\n",
      "--------\n",
      "Direct observational records since the mid-2000s remain too short to \n",
      "determine the relative contributions of internal variability, natural \n",
      "forcing and anthropogenic forcing to AMOC change (high confidence) \n",
      "(WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., \n",
      "2021). \n",
      "--------\n"
     ]
    }
   ],
   "source": [
    "for source_node in window_response.source_nodes:\n",
    "    print(source_node.node.metadata[\"original_text\"])\n",
    "    print(\"--------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we can see that the sentence window index easily retrieved two nodes that talk about AMOC. Remember, the embeddings are based purely on the original sentence here, but the LLM actually ends up reading the surrounding context as well!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's try and disect why the naive vector index failed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AMOC mentioned? False\n",
      "--------\n",
      "AMOC mentioned? False\n",
      "--------\n",
      "AMOC mentioned? True\n",
      "--------\n",
      "AMOC mentioned? False\n",
      "--------\n",
      "AMOC mentioned? False\n",
      "--------\n"
     ]
    }
   ],
   "source": [
    "for node in vector_response.source_nodes:\n",
    "    print(\"AMOC mentioned?\", \"AMOC\" in node.node.text)\n",
    "    print(\"--------\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So source node at index [2] mentions AMOC, but what did this text actually look like?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2021; Gulev et al. \n",
      "2021)The AMOC will decline over the 21st century \n",
      "(high confidence, but low confidence for \n",
      "quantitative projections).4.3.2.3, 9.2.3 (Fox-Kemper \n",
      "et al. 2021; Lee et al. \n",
      "2021)\n",
      "Sea ice\n",
      "Arctic sea ice \n",
      "changes‘Current Arctic sea ice coverage levels are the \n",
      "lowest since at least 1850 for both annual mean \n",
      "and late-summer values (high confidence).’2.3.2.1, 9.3.1 (Fox-Kemper \n",
      "et al. 2021; Gulev et al. \n",
      "2021)‘The Arctic will become practically ice-free in \n",
      "September by the end of the 21st century under \n",
      "SSP2-4.5, SSP3-7.0 and SSP5-8.5[…](high \n",
      "confidence).’4.3.2.1, 9.3.1 (Fox-Kemper \n",
      "et al. 2021; Lee et al. \n",
      "2021)\n",
      "Antarctic sea ice \n",
      "changesThere is no global significant trend in \n",
      "Antarctic sea ice area from 1979 to 2020 (high \n",
      "confidence).2.3.2.1, 9.3.2 (Fox-Kemper \n",
      "et al. 2021; Gulev et al. \n",
      "2021)There is low confidence in model simulations of \n",
      "future Antarctic sea ice.9.3.2 (Fox-Kemper et al. \n",
      "2021)\n",
      "Ocean chemistry\n",
      "Changes in salinityThe ‘large-scale, near-surface salinity contrasts \n",
      "have intensified since at least 1950 […] \n",
      "(virtually certain).’2.3.3.2, 9.2.2.2 \n",
      "(Fox-Kemper et al. 2021; \n",
      "Gulev et al. 2021)‘Fresh ocean regions will continue to get fresher \n",
      "and salty ocean regions will continue to get \n",
      "saltier in the 21st century (medium confidence).’9.2.2.2 (Fox-Kemper et al. \n",
      "2021)\n",
      "Ocean acidificationOcean surface pH has declined globally over the \n",
      "past four decades (virtually certain).2.3.3.5, 5.3.2.2 (Canadell \n",
      "et al. 2021; Gulev et al. \n",
      "2021)Ocean surface pH will continue to decrease \n",
      "‘through the 21st century, except for the \n",
      "lower-emission scenarios SSP1-1.9 and SSP1-2.6 \n",
      "[…] (high confidence).’4.3.2.5, 4.5.2.2, 5.3.4.1 \n",
      "(Lee et al. 2021; Canadell \n",
      "et al. 2021)\n",
      "Ocean \n",
      "deoxygenationDeoxygenation has occurred in most open \n",
      "ocean regions since the mid-20th century (high \n",
      "confidence).2.3.3.6, 5.3.3.2 (Canadell \n",
      "et al. 2021; Gulev et al. \n",
      "2021)Subsurface oxygen content ‘is projected to \n",
      "transition to historically unprecedented condition \n",
      "with decline over the 21st century (medium \n",
      "confidence).’5.3.3.2 (Canadell et al. \n",
      "2021)\n",
      "Changes in nutrient \n",
      "concentrationsNot assessed in WGI Not assessed in WGI\n"
     ]
    }
   ],
   "source": [
    "print(vector_response.source_nodes[2].node.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So AMOC is disuccsed, but sadly it is in the middle chunk. With LLMs, it is often observed that text in the middle of retrieved context is often ignored or less useful. A recent paper [\"Lost in the Middle\" discusses this here](https://arxiv.org/abs/2307.03172)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Optional] Evaluation\n",
    "\n",
    "We more rigorously evaluate how well the sentence window retriever works compared to the base retriever.\n",
    "\n",
    "We define/load an eval benchmark dataset and then run different evaluations over it.\n",
    "\n",
    "**WARNING**: This can be *expensive*, especially with GPT-4. Use caution and tune the sample size to fit your budget."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset\n",
    "\n",
    "from llama_index.llms.openai import OpenAI\n",
    "import nest_asyncio\n",
    "import random\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "428"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(base_nodes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_nodes_eval = 30\n",
    "# there are 428 nodes total. Take the first 200 to generate questions (the back half of the doc is all references)\n",
    "sample_eval_nodes = random.sample(base_nodes[:200], num_nodes_eval)\n",
    "# NOTE: run this if the dataset isn't already saved\n",
    "# generate questions from the largest chunks (1024)\n",
    "dataset_generator = DatasetGenerator(\n",
    "    sample_eval_nodes,\n",
    "    llm=OpenAI(model=\"gpt-4\"),\n",
    "    show_progress=True,\n",
    "    num_questions_per_chunk=2,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_dataset.save_json(\"data/ipcc_eval_qr_dataset.json\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# optional\n",
    "eval_dataset = QueryResponseDataset.from_json(\"data/ipcc_eval_qr_dataset.json\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Compare Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import asyncio\n",
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.evaluation import (\n",
    "    CorrectnessEvaluator,\n",
    "    SemanticSimilarityEvaluator,\n",
    "    RelevancyEvaluator,\n",
    "    FaithfulnessEvaluator,\n",
    "    PairwiseComparisonEvaluator,\n",
    ")\n",
    "\n",
    "\n",
    "from collections import defaultdict\n",
    "import pandas as pd\n",
    "\n",
    "# NOTE: can uncomment other evaluators\n",
    "evaluator_c = CorrectnessEvaluator(llm=OpenAI(model=\"gpt-4\"))\n",
    "evaluator_s = SemanticSimilarityEvaluator()\n",
    "evaluator_r = RelevancyEvaluator(llm=OpenAI(model=\"gpt-4\"))\n",
    "evaluator_f = FaithfulnessEvaluator(llm=OpenAI(model=\"gpt-4\"))\n",
    "# pairwise_evaluator = PairwiseComparisonEvaluator(llm=OpenAI(model=\"gpt-4\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.evaluation.eval_utils import (\n",
    "    get_responses,\n",
    "    get_results_df,\n",
    ")\n",
    "from llama_index.core.evaluation import BatchEvalRunner\n",
    "\n",
    "max_samples = 30\n",
    "\n",
    "eval_qs = eval_dataset.questions\n",
    "ref_response_strs = [r for (_, r) in eval_dataset.qr_pairs]\n",
    "\n",
    "# resetup base query engine and sentence window query engine\n",
    "# base query engine\n",
    "base_query_engine = base_index.as_query_engine(similarity_top_k=2)\n",
    "# sentence window query engine\n",
    "query_engine = sentence_index.as_query_engine(\n",
    "    similarity_top_k=2,\n",
    "    # the target key defaults to `window` to match the node_parser's default\n",
    "    node_postprocessors=[\n",
    "        MetadataReplacementPostProcessor(target_metadata_key=\"window\")\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "base_pred_responses = get_responses(\n",
    "    eval_qs[:max_samples], base_query_engine, show_progress=True\n",
    ")\n",
    "pred_responses = get_responses(\n",
    "    eval_qs[:max_samples], query_engine, show_progress=True\n",
    ")\n",
    "\n",
    "pred_response_strs = [str(p) for p in pred_responses]\n",
    "base_pred_response_strs = [str(p) for p in base_pred_responses]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluator_dict = {\n",
    "    \"correctness\": evaluator_c,\n",
    "    \"faithfulness\": evaluator_f,\n",
    "    \"relevancy\": evaluator_r,\n",
    "    \"semantic_similarity\": evaluator_s,\n",
    "}\n",
    "batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run evaluations over faithfulness/semantic similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_results = await batch_runner.aevaluate_responses(\n",
    "    queries=eval_qs[:max_samples],\n",
    "    responses=pred_responses[:max_samples],\n",
    "    reference=ref_response_strs[:max_samples],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "base_eval_results = await batch_runner.aevaluate_responses(\n",
    "    queries=eval_qs[:max_samples],\n",
    "    responses=base_pred_responses[:max_samples],\n",
    "    reference=ref_response_strs[:max_samples],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>names</th>\n",
       "      <th>correctness</th>\n",
       "      <th>relevancy</th>\n",
       "      <th>faithfulness</th>\n",
       "      <th>semantic_similarity</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Sentence Window Retriever</td>\n",
       "      <td>4.366667</td>\n",
       "      <td>0.933333</td>\n",
       "      <td>0.933333</td>\n",
       "      <td>0.959583</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Base Retriever</td>\n",
       "      <td>4.216667</td>\n",
       "      <td>0.900000</td>\n",
       "      <td>0.933333</td>\n",
       "      <td>0.958664</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                       names  correctness  relevancy  faithfulness  \\\n",
       "0  Sentence Window Retriever     4.366667   0.933333      0.933333   \n",
       "1             Base Retriever     4.216667   0.900000      0.933333   \n",
       "\n",
       "   semantic_similarity  \n",
       "0             0.959583  \n",
       "1             0.958664  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "results_df = get_results_df(\n",
    "    [eval_results, base_eval_results],\n",
    "    [\"Sentence Window Retriever\", \"Base Retriever\"],\n",
    "    [\"correctness\", \"relevancy\", \"faithfulness\", \"semantic_similarity\"],\n",
    ")\n",
    "display(results_df)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
